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Abstract. Data visualization is one of the major applications of nonlin¬ 
ear dimensionality reduction. From the information retrieval perspective, 
the quality of a visualization can be evaluated by considering the extent 
that the neighborhood relation of each data point is maintained while the 
number of unrelated points that are retrieved is minimized. This prop¬ 
erty can be quantified as a trade-off between the mean precision and 
mean recall of the visualization. While there have been some approaches 
to formulate the visualization objective directly as a weighted sum of the 
precision and recall, there is no systematic way to determine the opti¬ 
mal trade-off between these two nor a clear interpretation of the optimal 
value. In this paper, we investigate the properties of Q-divergence for 
information visualization, focusing our attention on a particular range of 
a values. We show that the minimization of the new cost function cor¬ 
responds to maximizing a geometric mean between precision and recall, 
parameterized by a. Contrary to some earlier methods, no hand-tuning 
is needed, but we can rigorously estimate the optimal value of a. for 
a given input data. For this, we provide a statistical framework using 
a novel distribution called Exponential Divergence with Augmentation 
(EDA). By the extensive set of experiments, we show that the optimal 
value of a, obtained by EDA corresponds to the optimal trade-off be¬ 
tween the precision and recall for a given data distribution. 


1 Introduction 

Dimensionality reduction and, in particular, data visualization has been a promi¬ 
nent research track for the past few decades as an important step in data anal¬ 
ysis. Many non-linear dimensionality reduction methods such as Sammon map¬ 
ping m , Isomap [2], Locally Linear Embedding [5], Maximum Variance Unfolding 
[1] , and Laplacian Eigenmaps [5] have been proposed to overcome the shortcom¬ 
ings of the simple linear methods in handling data lying on non-linear manifolds. 
Although all these methods have been successfully applied to several artificial as 
well as real-world datasets, they have been particularly designed for unraveling 
the underlying low-dimensional manifold rather than providing a good visual¬ 
ization for the user. More specifically, these methods fail to represent the data 
properly on a two-dimensional display, especially when the inherent dimension¬ 
ality of the manifold is higher than two. 
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Preserving the neighborhood structure of the data points in a two-dimensional 
visualization is crucial for gaining initial intuition about the data. From an infor¬ 
mation retrieval perspective, the fundamental trade-off in a good visualization 
is to minimize the number of missing related points in the neighorhood of each 
point while avoiding the unrelated points to appear in the neighborhood. In other 
words, the goal of a good visualization is to provide a faithful representation of 
the data by achieving the best possible balance between precision and recall. 
Venna et al. [6] impose the trade-off between precision and recall by penalizing 
the violation of each by a different cost and then, obtain the final visualization 
by minimizing the total cost. Their method, Neighborhood Retrieval Visualizer 
(NeRV), achieves different visualizations by adjusting the trade-off parameter, 
based on the application. Their method also covers Stochastic Neighbor Embed¬ 
ding (SNE) method [7] as special case of maximizing the recall in the visual¬ 
ization. However, the trade-off parameter does not have any statistical interpre¬ 
tation and hence, must be set manually, by the user. Additionally, the optimal 
trade-ojf between the precision and recall for a particular dataset is unknown. 

In this paper, we consider the problem of finding the visualization of the 
data which achieves the optimal trade-off between the precision and recall. For 
this, we use a more general class of divergences, called a-divergence, as the cost 
function. This choice leads to a wider range of values, including the cost func¬ 
tion of NeRV. For estimating the optimal value of a, we present a statistical 
framework based on a recently proposed distribution called Exponential Diver¬ 
gence with Augmentation (EDA) [8]. EDA is an approximate generalization of 
Tweedie distribution, which has a well-established relation to /^-divergence. With 
a nonlinear transformation, an equivalence between /3 and a-divergences can be 
shown and EDA can also be used for estimation of a. As our contributions, we 
provide a proof that minimizing the a-divergence is equivalent to maximizing 
the geometric mean between precision and recall, parameterized by a. This pro¬ 
vides an upper bound for the trade-off, achieved by NeRV. By an extensive set 
of experiments on different types of datasets, we also show that the visualization 
obtained by using the optimal value of a provides a faithful representation of 
the original data, attaining the optimal trade-off between precision and recall. 

The organization of the paper is as follows. We first start with briefly intro¬ 
ducing the information retrieval perspective for data visualization in Section 
and then, provide the motivation for using the a-divergence as the cost func¬ 
tion and explore the characteristics of its gradient. In Section we present our 
framework for estimating the optimal value of a for a given data distribution. 
We provide our experimental results in Section and finally, draw conclusions 
in Section 

2 Problem Definition 

2.1 Information Retrieval Perspective for Dimensionality Reduction 

Let {xi} G denote the high-dimensional representation of the input points. 
The binary neighbors of point i in the input space, denoted by Pi, can be de- 


Optimizing the Information Retrieval Trade-off 


3 


fined as the set of points that fall within a fixed radius or a fixed number of 
nearest-neighbors of x^. From an information retrieval perspective, the aim of 
the dimensionality reduction is to provide the user a low-dimensional representa¬ 
tion of the data, i.e. {y^} G M'^, in which the neighborhood relation of the points 
is preserved as much as possible. In other words, the number of relevant points 
that are retrieved for each point is maximized while minimizing the number of 
irrelevant points appearing appearing in the neighborhood. These notions can be 
quantified by considering the mean precision and mean recall of the embedding. 
For this purpose, let Qi denote the binary neighbors of each point in the low¬ 
dimensional embedding, defined in a similar manner. The precision and recall for 
point i is defined as the number of points that are common in Pi and Qi divided 
by the size of Pi and respectively. The mean precision and mean recall of 
the embedding can be calculated by talking the average over the precision and 
recall values of all points, respectively. 

Venna et al. [5] introduce a generalization of the binary neighborhood re¬ 
lationship by providing a probabilistic model of neighborhood for each point. 
In other words, each point j ^ i has a non-negative probability qij of being a 
relevant neighbor of point i in the embedding. This probability can be obtained 
by normalizing a non-increasing function of distance between i and j in the em¬ 
bedding. Similar probabilistic neighborhood relation pij can be defined in the 
input space. Under the probabilistic models of neighborhood, a natural way to 
obtain the embedding is to minimize the sum of Kullback-Leibler (KL) costs 
between pij's and qij values. This is essentially the cost function adopted in the 
SNE method [7], that is, 

CsNE = Xl^KL(Pillqi) = —• (1) 

I i j 

It is shown in [5] that 77KL(Pj|lqi) is a generalization of recall. Therefore, the SNE 
method is equivalent to maximizing the recall in the visualization. On the other 
hand, OKL(qillPi) induces a generalization of precision. Thus, the NeRV method 
[6] promotes using a convex sum of £*KL(Pi||qi) and OKL(qillPi) divergences as 
the cost function, 

CxeRV = A ^ £)KL(Pi|lqi) + (1 — A) ^ OKL(qj|Pi) • (2) 

i i 

Parameter 0 < A < 1 controls the trade-off between maximizing the generalized 
precision (A = 1) or maximizing the generalized recall (A = 0) in the embed¬ 
ding. Under the binary neighborhood assumption, these become equivalent to 
maximizing precision and recall, respectively. 

While imposing a balance between precision and recall in a visualization 
happens to be crucial, the NeRV method induces the extra cost of setting the 
parameter A to a reasonable value. The user might try several different values of 
A and asses the results manually. However, there is no systematic way to estimate 
the optimal value using the data. To overcome this problem, we introduce a more 
general class of divergences, called a-divergence, which includes the NeRV cost 
function as special cases, in the following section. 
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2.2 a-Divergence for Stochastic Neighbor Embedding 


The (asymmetric) a-divergence over discrete distributions is defined by 


-Da(pllq) 


J2^Pi9i °‘-ap^ + ic 

a{a — 1) 


l)9i 


( 3 ) 


As special cases, it includes many well-known divergences, such as -DKL(qllp) and 
'DKL(pllq) which are obtained in the limit a —>■ 0 and a —>■ 1, respectively [S]- 
We are mainly interested in the interval a G [0,1]. The points a = 0 and a = 1 
amount to the cost function of NeRV when A = 0 and A = 1, respectively. More 
generally, when a varies from 0 to 1, a-divergence passes smoothly through 
all values of NeRV cost function for A G (0,1) since the divergence itself is a 
continuous function of a. Note that the mapping from A to a is not onto; this 
can be easily seen by considering two arbitrary distributions and varying A and 
a and, finding the value of each. Thus, a-divergence covers an even wider range 
of values compared to the convex sum of T*KL(p||q) and iAKL(q||p)- 

As an important result, a-divergence amounts to maximizing a geometric 
mean of precision and recall, parameterized by a. As an immediate result of 
the geometric-arithmetic mean inequality, it can be shown that maximizing the 
geometric mean exceeds the maximum arithmetic mean for equal trade-off pa¬ 
rameters a = A. The proof is in Appendix A. 

The aforementioned properties promote investigating a sum of a-divergences 
over all pairs of distributions as the new cost function. We call the method 
a-SNE, for Stochastic Neighbor Embedding with a-divergence. The new cost 
function to minimize becomes 


Ca-SNE = Da{Pi\\<ii) 

i 


' ^ ^ pfjqlj “ - apy +ia- l)q,j ^ 

a(a - 1) “ ^ ’ 

Ei Ei log(( 7 y/py) a = 0 

. Ei Ej Pij log(p*j /qij) a = 1 

(4) 


2.3 Notes on the Gradient and Practical Matters on Optimization 

More interesting properties are revealed by considering the gradient, 

* jpi 

in which 0 < 9i = we call it the compatibility factor for point i, 

with the following properties: = 1 if = q^, and < 1 for a S (0,1] (except 
a = (Q. The gradient has an interpretation of springs between map points with 

^ The gradient for the case a = 0 can be obtained in the limit a —>■ 0 where we have 
dCa-SNE = 2y^(i/i-i/j) (^gyT>KL(qi|jPi) - log — + gjiT>KL(qj||Pj) - (7ji log . 

dyi ^ \ Pij pjij 
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Gradient of SNE 



High-dimensional distance -> 


(a) Gradient of SNE 


Gradient of a-SNE , a = 0.8 



High-dimensional distance -> 


(b) Gradient of a-SNE 


Fig. 1. Gradients of SNE and q-SNE as a function of the pairwise Euclidean distances 
between two points in the high-dimensional space and the low-dimensional image, a- 
SNE produce more balanced gradients compared to SNE. Note different color scales. 


stiffness proportional to the mismatch in the probability distributions, similar 
to SNE, 

^CsNE 




= 2 - Vj) {Pij - q^J + PJ^ - Qji) ■ 


( 6 ) 


However, comparing the gradient with the gradient of SNE, it can be seen that 
the attraction terms pij and pji are replaced by pfjq]^'^ and respec¬ 

tively. On the other hand, the repulsion terms qtj and qji are weighted by the 
compatibility factors for points i and j, respectively. Therefore, the compatibil¬ 
ity factor for point i can also be seen as the sum of the attraction terms between 
i and the rest of the points. Finally, the whole gradient is scaled by a factor of 
l/o. If a = 1, then ([^ a nd ([^ agree. 

Figure [I(a^ and 1(b) show the gradients of SNE and a-SNE (with a = 0.8), 
respectively, for a pair of points, as a function of their Euclidean distances in 
the high-dimensional space and the low-dimensional space. Positive values rep¬ 
resent attraction, while negative values correspond to repulsion. As can be seen 
in Figure |l(a)[ SNE exerts a large attraction force for moderately close dat- 
apoints which are mapped far from each other. However, the repulsion force 
is comparatively small for the opposite case (around 19 to 1). These proper¬ 
ties are consistent with the fact that the SNE method maximizes the recall by 
collapsing the neighboring relevant points together. On the other hand, a-SNE 
(Figure [I(b)[ ) results in a more balanced gradient, compared to SNE, by damping 
the large attraction forces and further, strongly repelling dissimilar datapoints 
which are mapped close together. Thus, a-SNE yields a more balanced trade-off 
between precision and recall, compared to SNE. 

As a final remark, the optimization of the cost function Q, for a given a, can 
be achieved using standard methods, e.g., steepest descent. A jitter noise with 
a constant variance can be used to model simulated annealing in early stages. 
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3 CK-Optimization 

After defining the cost function of a-SNE and obtaining a method to appro¬ 
priately optimize the cost function, there remains the problem of selecting the 
a value for a particular dataset, which results in the optimal trade-off between 
precision and recall. 

The optimization of the a parameter is performed using Exponential Diver¬ 
gence with Augmentation (EDA) [8], a distribution proposed initially for max¬ 
imum likelihood estimation of (3 in /3-divergence D^(u||/i), where u and /r are 
any positive vectors such as probability distributions. Typically u is known and 
/i is a parametric approximation. Once the optimal /3 is found, the optimal a is 
obtained by a simple transformation. EDA is an approximation to the Tweedie 
distribution, pTw(u; Aj /?) which is related to /3-divergence in such a way that 
/r* that maximizes the likelihood of pxw also minimizes Dp. While ,0-divergence 
does not provide a means to optimize /3 directly, our basic idea is that the like¬ 
lihood of /3 stemming from pxw can be maximized for that purpose. Note that 
both a and /3 divergences are separable, and so we can operate component-wise. 
The pdf of the one-variate Tweedie density of Ui is given as 



for /3 ^ —1,0 where ^ > 0 is the dispersion parameter and f{ui,(j),l3) is a 
weighting function, whose analytical form is generally not available and has to 
be approximated. However, there are some shortcomings associated with Tweedie 
likelihood, especially, the pdf does not exist for (3 G (0,1) and approximation of 
f{ui,(j), (3) is not well studied for /3 > 1. The EDA density is proposed to overcome 
these issues, while being a close approximation to Tweedie distribution. Using 
the relation with /3-divergence and Laplace’s method, its pdf is found to be of 
the form [5] 

PEDA(Wi; A*, 4>, P) = 



( 8 ) 


where is a normalizing constant and Dp{ui\\pLi) appears in the exponent. 
Evaluation of Zp^p requires an integration in one dimension. Although it is not 
available analytically in generaQ it can be evaluated numerically using stan¬ 
dard statistical software. The parameters P and p can be optimized either by 
maximizing the likelihood or using methods for parameter estimation in non- 
normalized densities, such as Score Matching (SM) [lU]. Both of these methods 
have been successfully used to find optimal /3 values [5]. 

^ In fact, is analytically available for /3 = 1, 0, —1, —2, which correspond to Gaus¬ 
sian, Poisson, Gamma and Inverse Gaussian distributions, respectively, which are 
also special cases of Tweedie distribution. 
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Dataset 

Size 

PCA 

LLE 

LEM 

Isomap 

SNE 

t-SNE 

HSSNE 

NeRV 

a-SNE 

a-SNE 

(EDA) 

Iris 

150 

0.85 

0.70 

0.85 

0.65 

0.88 

0.86 

0.83 

0.89 

0.90 

0.88 

Wine 

178 

0.50 

0.49 

0.51 

0.47 

0.64 

0.69 

0.68 

0.69 

0.72 

0.69 

Image Segs* 

210 

0.40 

0.27 

0.27 

0.58 

0.74 

0.83 

0.77 

0.80 

0.84 

0.80 

Glass 

214 

0.50 

0.50 

0.50 

0.53 

0.71 

0.73 

0.68 

0.73 

0.75 

0.74 

Leaf 

340 

0.63 

0.71 

0.68 

0.60 

0.71 

0.72 

0.66 

0.74 

0.76 

0.74 

Olivetti Faces 

400 

0.26 

0.24 

0.28 

0.25 

0.35 

0.46 

0.45 

0.44 

0.46 

0.44 

UMist Faces 

565 

0.36 

0.32 

0.38 

0.49 

0.51 

0.72 

0.65 

0.70 

0.75 

0.75 

Vehicle 

846 

0.26 

0.25 

0.30 

0.29 

0.47 

0.61 

0.57 

0.58 

0.63 

0.63 

USPS Digits* 

1000 

0.08 

0.06 

0.10 

0.16 

0.21 

0.38 

0.36 

0.34 

0.40 

0.39 

COIL20 

1440 

0.28 

0.30 

0.30 

0.21 

0.62 

0.79 

0.74 

0.77 

0.80 

0.79 

MIT Scene 

2686 

0.05 

0.04 

0.05 

0.07 

0.09 

0.21 

0.22 

0.19 

0.22 

0.22 

Texture* 

2986 

0.27 

0.19 

0.28 

0.14 

0.49 

0.60 

0.55 

0.57 

0.62 

0.61 

MNIST* 

6000 

0.27 

0.16 

0.26 

0.09 

0.11 

0.34 

0.32 

0.26 

0.35 

0.35 


* Only a subset of the original datasets are used. 


Table 1. Area under ROC curve (AUC) for different methods. 


It is possible to use EDA to optimize a, too, using the relation between a and 
,5-divergences. Note that both divergences are separable and we can formulate 
the relation using just scalars. We have 


Dp(ui\\y,i) ^ Dc{vi\\mi), (9) 

with a nonlinear transformation Ui = vf j= mf /and /? = 1/a — 1 for 
a 7 ^ 0. This relationship allows us to evaluate the likelihood of uii and a using 
Ui and 13: 

p{vi,mi,a,4>) = pEDA(Mi;/ri,</>,/3)ii“^|/3 -h 1|. (10) 


a can be optimized (alongside (jj) by maximizing its likelihood given by 
p(vi; rrii, a, </) or minimizing the SM objective function evaluated from above. It 
is more convenient to treat as constant, fixed to the value which minimizes 
the a-divergence. It is also possible to optimize it using EDA. 

To solve our original problem, we fix the Vi values in (101 to the vectorized 
form of matrix P G which contains probabilities Pi in a-SNE in each 

column. We set m-i to the vectorized form of matrix Q G which is formed 

in a similar manner for the map points. We then compute the values Ui and pi 
from the above transformation and optimize jointly over (a, </>) by minimizing 
the score matching objective function of the unnormalized EDA density (10) and 
select the best a value. 


4 Experiments 

In this section, we provide a set of experiments on different real-world datasets 
to asses the performance of a-SNE compared to other dimensionality reduction 
methods. We consider the following methods for comparison: Principal Com¬ 
ponent Analysis (PCA), Locally Linear Embedding (LEE), Isomap, Laplacian 
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(a) (b) 


(c) (d) 


Fig. 2. Area under ROC curve (first row), and log-likelihood obtained using EDA on 
(a) UMist Faces, (b) Vehicle, (c) USPS Digits, and (d) Texture datasets. 


Eigenmaps (LEM), SNE, t-distributed SNE (t-SNE), Heavy-tailed Symmetric 
SNE (HSSNE), and NeRV. Our implementation of a-SNE is in MATLAB and 
the code along with the description of the datasets used is available onlin^ Eor 
the other methods, we use publicly available implementations. 

As the goodness measure, we consider the area under receiver operating char¬ 
acteristic (ROC) curve (AUC). In this way, we can combine the mean precision 
and mean recall into a single value which can easily be used to compare the 
performance of different approaches. To calculate AUC, we hx the neighbor¬ 
hood size in the input space to 20-nearest neighbors and vary the number of 
neighbors in the output space from 1 to 100 to calculate precision and recall. 
Finally, we compute the area under the resulting ROC curve. For each dataset, 
we repeat the experiments 20 times with different random initializations and 
report the averages over all the trials. For LLE, Isomap, and LEM, we consider a 
wide range of parameters and report the maximum AUC obtained. We set the 
tail-heaviness parameter in HSSNE equal to 2. For NeRV, we perform a linear 
search over different values of A and select the one which results in maximum 
AUC. For our method, we consider two approaches: as the greedy approach, we 
perform a linear search over different values of a and report the maximum AUC 
value obtained. Moreover, we apply the optimization framework in Section 
and report the AUC value obtained using the optimal value of a. By this, we 
can also asses the performance of EDA for finding the optimal a for information 
retrieval. 


https://github.com/eamid/alpha-SNE 


3 












Optimizing the Information Retrieval Trade-off 


9 


PCA LLE LEM Isomap 



Fig. 3. Visualization od the Texture dataset using different dimensionality reduction 
methods: (a) PCA, (b) LLE, (c) LEM, (d) Isomap, (e) SNE, (f) t-SNE, (g) HSSNE, and 
(h) a-SNE. 


The results are shown in Table As can bee seen from the table, a- SNE 
method outperforms the other methods on all datasets by means of AUC. Ad¬ 
ditionally, the value obtained by EDA is exactly the same or very close to the 
maximum value, obtained by linear search. Note that the estimation of optimal 
a using EDA becomes more accurate as the size of the dataset increases. This is 
a result of having more accurate data distribution as more data samples become 
available. Figure [^illustrates the examples of AUC and log-likelihood curves on 
different datasets, obtained by varying the a parameter. Clearly, the maximum 
likelihood estimate of a using EDA and the a value that maximizes the AUC, 
i.e., that yields the optimal trade-off between precision and recall, coincide in 
most cases, or at least, are located very close together. This confirms our claim 
that the optimal value of a obtained by EDA is the one that yields a visualization 
that represents the data best. 

Finally, Figure [^ shows the visualization of the Texture dataset using dif¬ 
ferent dimensionality reduction methods. As can be seen, the clusters are well 
separated only using t-SNE, HSSNE, and a-SNE methods. However, the result of 
t-SNE is somehow misleading as the clusters are excessively over-separated. This 
can be seen by increasing the tail-heaviness parameter from w = 1 in t-SNE to 
Cli = 2 in HSSNE, where the clusters are even more squeezed. This is in contrast 
with reality where the clusters are distinguished but located fairly close to each 
other. The result obtained by a-SNE illustrates this property by maximizing the 
AUC of the visualization. As the final remark, the visualization obtained from 
NeRV is visually similar to the one obtained from a-SNE, however, yields a lower 
AUC value. 
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5 Conclusions and Future Work 

We proposed the a-SNE method to obtain a faithful representation of the data on 
a two-dimensional screen. We showed that the minimization of the cost function 
corresponds to maximizing the geometric mean of the precision and recall of the 
visualization. We also proposed a statistical framework to estimate the optimal 
a parameter for the visualization, purely based on the data. By an extensive set 
of experiments, we showed that our proposed method outperforms the previous 
dimensionality reduction methods by means of the quality of the visualization. 
Additionally, the experiments verify our claim that the optimal value of a under 
the EDA distribution yields the optimal trade-off between precision and recall. 

As future work, we would like to extend our method to incorporate heavy¬ 
tailed distributions in the low-dimensional space. The generalization of optimiza¬ 
tion of the EDA to incorporate the joint distribution over a and tail-heaviness 
parameter is left to future work. 

Appendix A. Proof of the Connection Between ck-SNE 
Cost Function and Precision and Recall 

We consider the binary neighborhood model in both the input space and the low¬ 
dimensional embedding, similar to [5] . By binary neighborhood, we assume that 
each point has a fixed number of equally relevant neighbors in both the input 
and the embedding. Under this model, the probabilistic neighborhood models in 
the input space and the embedding are defined as 



( 11 ) 


where N is the number of the points and 0 < (5 ^ 0.5 is a small constant. 


assigning a small portion of the probability mass to irrelevant points. 

We now show that minimizing the a-divergence between the probabilities in 
the input space and the embedding is equivalent to maximizing the geometric 
mean of precision and recall. We consider the case 0 < a < 1. Similar results 
hold for a = 0,1 (see 0 )- Using the fact that '^^Pi = = 1 and ignoring 

the constant factors, the a-divergent cost function to be minimized becomes 




jePijeQ, 


jdPidiQi 





l-( 



)7Vtp - N^iss - iVpp - {bfd]-^) A^tn 


miss 


( 12 ) 
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Substituting the values in the cost function gives 
(1-5) (l-5)“5i““ 




5 “( 1 - 5)1 


(N - n- l)°‘k. 


1 —a 


A^fp — 


(iV-r,-l)“(lV-fc,-l)i- 


-TVtn ■ 
(13) 


For small values of S, the first two terms are dominating and the remaining 
terms become negligible. Therefore, in the limit 5—^0, minimizing the cost 
in (131 becomes equivalent to maximizing the term 

A^xp 


PR(a) = 




1 — a 


which is the geometric mean of precision and recall, parameterized by a. The 
result is also consistent in the limit points a = 0 and a = 1 where PR(0) = 
and PR(1) = reduce to recall and precision, respectively. As a direct result 
of geometric-arithmetic mean inequality, for any values of a = A, a-divergence 
maximizes an upper bound for the convex sum of precision and recall, that is. 


Ti ki rfk] ' 


(14) 
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